Final Project - Indicators of Anxiety or Depression Based on Reported Frequency of Symptoms During Last 7 Days

Author

Ian Walsh & Logan Rosell

Published

November 12, 2025

Reseach Question: How did anxiety and depression levels differ between states and regions following the outbreak of COVID-19 in the United States?

Data Cleaning

Import libraries and dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings

warnings.simplefilter(action='ignore', category=pd.errors.SettingWithCopyWarning)

df = pd.read_csv("./Datasets/Indicators_of_Anxiety_or_Depression_Based_on_Reported_Frequency_of_Symptoms_During_Last_7_Days.csv")

df.head()
Indicator Group State Subgroup Phase Time Period Time Period Label Time Period Start Date Time Period End Date Value Low CI High CI Confidence Interval Quartile Range
0 Symptoms of Depressive Disorder National Estimate United States United States 1 1 Apr 23 - May 5, 2020 04/23/2020 05/05/2020 23.5 22.7 24.3 22.7 - 24.3 NaN
1 Symptoms of Depressive Disorder By Age United States 18 - 29 years 1 1 Apr 23 - May 5, 2020 04/23/2020 05/05/2020 32.7 30.2 35.2 30.2 - 35.2 NaN
2 Symptoms of Depressive Disorder By Age United States 30 - 39 years 1 1 Apr 23 - May 5, 2020 04/23/2020 05/05/2020 25.7 24.1 27.3 24.1 - 27.3 NaN
3 Symptoms of Depressive Disorder By Age United States 40 - 49 years 1 1 Apr 23 - May 5, 2020 04/23/2020 05/05/2020 24.8 23.3 26.2 23.3 - 26.2 NaN
4 Symptoms of Depressive Disorder By Age United States 50 - 59 years 1 1 Apr 23 - May 5, 2020 04/23/2020 05/05/2020 23.2 21.5 25.0 21.5 - 25.0 NaN

Rename the “Value” column to “Percent” to more accurately portray what the data is measuring

df.rename(columns={"Value": "Percent of Population"}, inplace = True)

Filter to only state data and drop unnecessary columns;Group and Subgroup are redundant, Time period and CI are just combinations of other column’s data.

state_data = df[df['Group']=='By State']

state_data.drop(columns = ['Group', 'Subgroup', 'Time Period Label', 'Confidence Interval'], inplace = True)

Seperate Quartile Range into 2 Columns:

state_data[['Quartile_Lower', 'Quartile_Upper']] = state_data['Quartile Range'].str.split(' - ', expand=True)
state_data.drop(columns='Quartile Range')
Indicator State Phase Time Period Time Period Start Date Time Period End Date Percent of Population Low CI High CI Quartile_Lower Quartile_Upper
19 Symptoms of Depressive Disorder Alabama 1 1 04/23/2020 05/05/2020 18.6 14.6 23.1 16.5 20.7
20 Symptoms of Depressive Disorder Alaska 1 1 04/23/2020 05/05/2020 19.2 16.8 21.8 16.5 20.7
21 Symptoms of Depressive Disorder Arizona 1 1 04/23/2020 05/05/2020 22.4 19.4 25.5 22.2 24.0
22 Symptoms of Depressive Disorder Arkansas 1 1 04/23/2020 05/05/2020 26.6 22.3 31.3 24.1 28.7
23 Symptoms of Depressive Disorder California 1 1 04/23/2020 05/05/2020 25.4 22.5 28.6 24.1 28.7
... ... ... ... ... ... ... ... ... ... ... ...
14368 Symptoms of Anxiety Disorder or Depressive Dis... Virginia 3.10 62 09/20/2023 10/02/2023 33.2 29.8 36.7 30.7-33.5 None
14369 Symptoms of Anxiety Disorder or Depressive Dis... Washington 3.10 62 09/20/2023 10/02/2023 34.3 31.0 37.7 33.6-36.2 None
14370 Symptoms of Anxiety Disorder or Depressive Dis... West Virginia 3.10 62 09/20/2023 10/02/2023 44.7 40.0 49.4 36.3-44.7 None
14371 Symptoms of Anxiety Disorder or Depressive Dis... Wisconsin 3.10 62 09/20/2023 10/02/2023 30.4 27.1 33.9 24.5-30.6 None
14372 Symptoms of Anxiety Disorder or Depressive Dis... Wyoming 3.10 62 09/20/2023 10/02/2023 36.3 30.1 42.9 36.3-44.7 None

9486 rows × 11 columns

Clean up the Phase column:

state_data['Phase'].unique()
# There are 2 values that contain dates which are already stored in other columns, so we can remove these dates

state_data['Phase'] = state_data['Phase'].str.split(' ', expand = True).get(0)

Add a column for Region of the united states based on the US census (Census Regions and Divisions of the United States)

Change Data Types as needed

state_data['Indicator'] = pd.Categorical(state_data['Indicator'], categories = ['Symptoms of Depressive Disorder', 'Symptoms of Anxiety Disorder', 'Symptoms of Anxiety Disorder or Depressive Disorder'])

state_data['Phase'] = pd.Categorical(state_data['Phase'], categories=['1', '2', '3', '3.1', '3.2', '3.3', '3.4', '3.5', '3.6', '3.7', '3.8', '3.9', '3.10'])

state_data['Time Period Start Date'] = pd.to_datetime(state_data['Time Period Start Date']).dt.date
state_data['Time Period End Date'] = pd.to_datetime(state_data['Time Period End Date']).dt.date

EDA

Looking at some graphs

# Histogram of values for all states
plt1 = sns.histplot(state_data, x='Percent of Population', hue = 'Indicator', alpha = 0.5)
plt.title('Histogram of Percent of Population by Indicator')
plt.show()

# Pair Plot
pair_plot = sns.pairplot(state_data, hue = 'Indicator')
plt.show()

national_avgs = state_data.groupby(['Time Period Start Date', 'Indicator'], observed=False).agg(
    nat_means = ('Percent of Population', 'mean')
)
nat_avg_plt = sns.lineplot(national_avgs,
                            x='Time Period Start Date',
                            y='nat_means',
                            hue = 'Indicator')
plt.xticks(rotation=45)
plt.title(f"Percent of Population Over Time")
plt.ylabel('Percent of Population')
plt.show()

state_code_map = {
    "Alabama": "AL",
    "Alaska": "AK",
    "Arizona": "AZ",
    "Arkansas": "AR",
    "California": "CA",
    "Colorado": "CO",
    "Connecticut": "CT",
    "Delaware": "DE",
    "Florida": "FL",
    "Georgia": "GA",
    "Hawaii": "HI",
    "Idaho": "ID",
    "Illinois": "IL",
    "Indiana": "IN",
    "Iowa": "IA",
    "Kansas": "KS",
    "Kentucky": "KY",
    "Louisiana": "LA",
    "Maine": "ME",
    "Maryland": "MD",
    "Massachusetts": "MA",
    "Michigan": "MI",
    "Minnesota": "MN",
    "Mississippi": "MS",
    "Missouri": "MO",
    "Montana": "MT",
    "Nebraska": "NE",
    "Nevada": "NV",
    "New Hampshire": "NH",
    "New Jersey": "NJ",
    "New Mexico": "NM",
    "New York": "NY",
    "North Carolina": "NC",
    "North Dakota": "ND",
    "Ohio": "OH",
    "Oklahoma": "OK",
    "Oregon": "OR",
    "Pennsylvania": "PA",
    "Rhode Island": "RI",
    "South Carolina": "SC",
    "South Dakota": "SD",
    "Tennessee": "TN",
    "Texas": "TX",
    "Utah": "UT",
    "Vermont": "VT",
    "Virginia": "VA",
    "Washington": "WA",
    "West Virginia": "WV",
    "Wisconsin": "WI",
    "Wyoming": "WY",
    "District of Columbia": "DC",
    "American Samoa": "AS",
    "Guam": "GU",
    "Northern Mariana Islands": "MP",
    "Puerto Rico": "PR",
    "United States Minor Outlying Islands": "UM",
    "Virgin Islands, U.S.": "VI",
}

indicators = state_data['Indicator'].unique()
color_scales = ['blues','amp','purp']

for i,j in zip(indicators,color_scales):
    fig_data = state_data[(state_data['Indicator'] == i)]

    fig_data['State_Code'] = fig_data['State'].map(state_code_map)

    max = fig_data['Percent of Population'].max()
    min = fig_data['Percent of Population'].min()

    fig = px.choropleth(
        fig_data,
        locations='State_Code',
        locationmode='USA-states',
        color='Percent of Population',
        scope='usa',
        title=f'Map of {i} in US states',
        hover_name='State',
        color_continuous_scale=j,
        animation_frame='Time Period Start Date',
        range_color=[min,max]
    )
    fig.show()

Linear Modeling